Objective of this project is to predict a movie revenue based on historic data about movie revenues and performance at global box office.
Such a prediction would be useful for optimization in many areas during various stages of planning and production of movies, for instance, selection of actors, crew, location, production spend, marketing spend, logistics and so on. Predicting revenue potential would enable movie production enterprises make wise investment decisions, come up with movies with plots relevant to society, higher entertainment satisfaction and ultimately greater good of all the involved parties.
In this public competition hosted by Kaggle, I am presented with metadata on past films from The Movie Database to try and predict their overall worldwide box office revenue. Data points provided include cast, crew, plot keywords, budget, posters, release dates, languages, production companies, and countries.
Data is downloaded from this link https://www.kaggle.com/c/tmdb-box-office-prediction/data
Since the dataset is small enough, I made a local copy and checked in to git along with project.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from tqdm import tqdm
from datetime import datetime
import time
import json
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold, train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import mean_squared_error
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor
from wordcloud import WordCloud
import eli5
import shap
shap.initjs()
warnings.filterwarnings("ignore")
%matplotlib inline
Review data structure, get a feel for data types and null values
train = pd.read_csv('data/train.csv')
train.info()
There are 3000 records in dataset, looking at data types and null values in few of the columns, it will be an interesting challenge with data exploration and exploratory data analysis.
Review test data set as well...
test = pd.read_csv('data/test.csv')
test.info()
Let us jump in to some exploratory data analysis, take a look at few records to get a feel of actual data.
We have some JSON key value pairs, free form text, categorical variables and continuous variables, as well as release date. Dream come true for aspiring ML Engineer to play with this kind of data!
train.head()
Descriptive stats in train and test datasets.
train.describe(include='all')
test.describe(include='all')
Count of missing values in each column, in both train and test datasets. Gives an early indication on which columns need appropriate handling for missing values and strategies for handling them.
train.isna().sum()
test.isna().sum()
Using pandas_profiling to accelerate data exploration. For details on this module -> https://pandas-profiling.github.io/pandas-profiling/docs/
And a good article discussing advantages of using such package for data science efforts -> https://towardsdatascience.com/a-better-eda-with-pandas-profiling-e842a00e1136
import pandas_profiling
train.profile_report(style={'full_width':True})
Joint plots to start visualizing relationships between numeric variables...
Budget vs Revenue
sns.jointplot(x="budget", y="revenue", data=train, height=10, ratio=5, color="b")
plt.show()
Popularity vs Revenue
sns.jointplot(x="popularity", y="revenue", data=train, height=10, ratio=5, color="b")
plt.show()
Runtime vs Revenue
sns.jointplot(x="runtime", y="revenue", data=train, height=10, ratio=5, color="g")
plt.show()
Revenue Distribution
Majority of movies are near the zero line, 75% of them under 68 Million. So, we need to apply log scaling to make the data distribution conducive to analysis/modeling
sns.distplot(train.revenue)
train.revenue.describe()
Apply log1p function to take care of any odd cases with 0 revenues breaking application of log
train['logRevenue'] = np.log1p(train['revenue'])
sns.distplot(train['logRevenue'])
train['release_date'].head()
Since only last two digits of the year are provided, let us manipulate and get correct year
# Split release date to individual parts
train[['release_month','release_day','release_year']]=train['release_date'].str.split('/',expand=True).replace(np.nan, -1).astype(int)
train['release_year'].max()
train.loc[ (train['release_year'] <= 19) & (train['release_year'] < 100) , "release_year"] += 2000
train.loc[ (train['release_year'] > 19) & (train['release_year'] < 100) , "release_year"] += 1900
Additional Features from Release Date
# Make new features from date, day of week and quarter...
releaseDate = pd.to_datetime(train['release_date'])
train['release_dayofweek'] = releaseDate.dt.dayofweek
train['release_quarter'] = releaseDate.dt.quarter
Plot Release Year Count - shows significant uptick of movie releases from 1980's onwards, makes sense as society becomes more affluent, with noticeable dips due to economical conditions.
plt.figure(figsize=(20,10))
sns.countplot(train['release_year'].sort_values())
plt.title("Movie Release counts by year")
loc, labels = plt.xticks()
plt.xticks(rotation=90)
plt.show()
Plot Release Month Counts Shows month wise variations in release counts, interesting observation that late summer and early fall has maximum number of releases.
plt.figure(figsize=(20,10))
sns.countplot(train['release_month'].sort_values())
plt.title("Movie Release counts by Month")
loc, labels = plt.xticks()
loc, labels = loc, ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
plt.xticks(loc, labels)
plt.show()
Plot Release Day Counts Mostly stable, First day of the month and middle of the month have many releases - Does it match with pay patterns of twice a month payments for salaried moviegoers? May be, but would be a tangent for problem at hand.
plt.figure(figsize=(20,10))
sns.countplot(train['release_day'].sort_values())
plt.title("Release Day Count")
plt.xticks()
plt.show()
Plot Release Day of Week Perfect, matches with intuition here, majority of movies are released on Fridays' followed by Thursdays' as moviegoers are eager to spend sometime on their weekend for entertainment.
plt.figure(figsize=(20,10))
sns.countplot(train['release_dayofweek'].sort_values())
plt.title("Movies released on day of week")
loc, labels = plt.xticks()
loc, labels = loc, ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
plt.xticks(loc, labels)
plt.show()
Plot Release quarter Counts No surprises here, matches with monthly trends, Summer and Fall have higher releases compared to other two quarters.
plt.figure(figsize=(20,10))
sns.countplot(train['release_quarter'].sort_values())
plt.title("Movies released in quarter")
plt.show()
Plot Release Year vs Revenue Good, matches with intuition and upward trend with movie release counts from 80's following valleys due to economic fluctuations.
train['meanRevenueByYear'] = train.groupby("release_year")["revenue"].aggregate('mean')
train['meanRevenueByYear'].plot(figsize=(20,10),color="r")
plt.xticks(np.arange(1920,2018,4))
plt.xlabel("Release Year")
plt.ylabel("Revenue")
plt.title("Movie Mean Revenue By Year")
plt.show()
Release Month vs Revenue Plot Interesting observation here, mean revenue peaks at June, and hits bottom during September (when maximum movies are released). Supply and demand?!
train['meanRevenueByMonth'] = train.groupby("release_month")["revenue"].aggregate('mean')
train['meanRevenueByMonth'].plot(figsize=(20,10),color="r")
plt.xlabel("Release Month")
plt.ylabel("Revenue")
plt.title("Movie Mean Revenue Release Month")
plt.show()
Release Day of Week vs Revenue Graph indicates that Wednesday releases have peak revenue compared to Friday and Sunday. However many movies are released on Friday followed by Thursday. This is not significant in our analysis as it would suggest.
train['meanRevenueByDayOfWeek'] = train.groupby("release_dayofweek")["revenue"].aggregate('mean')
train['meanRevenueByDayOfWeek'].plot(figsize=(20,10),color="r")
plt.xlabel("Day of Week")
plt.ylabel("Revenue")
plt.title("Movie Mean Revenue by Day of Week")
plt.show()
Movie Mean Runtime by year Visualizing average runtime of movies, we can see a trend movies were very long in early phases averaging 150 minutes, They hit a minimum during great recession, were bouncing around little less than 2 hour mark. It is interesting to mote that the average peaked just after II world war and again after Vietnam war? I think it is tangent exploring those aspects, it is more or less stable around 2 hours mark, let us move on...
train['meanruntimeByYear'] = train.groupby("release_year")["runtime"].aggregate('mean')
train['meanruntimeByYear'].plot(figsize=(20,10),color="r")
plt.xticks(np.arange(1920,2018,4))
plt.xlabel("Release Year")
plt.ylabel("Runtime")
plt.title("Movie Mean Runtime by Year")
plt.show()
Mean popularity by Year General uptick in movie popularity, no surprises, except a huge rise in 2016
train['meanPopularityByYear'] = train.groupby("release_year")["popularity"].aggregate('mean')
train['meanPopularityByYear'].plot(figsize=(20,15),color="r")
plt.xticks(np.arange(1920,2018,4))
plt.xlabel("Release Year")
plt.ylabel("Popularity")
plt.title("Movie Mean Popularity by Year")
plt.show()
Movie Mean Budget by Year General uptick in average budget spent on movies, matches with intuition on increased number of movies made over years. Matches with economic trends as well.
train['meanBudgetByYear'] = train.groupby("release_year")["budget"].aggregate('mean')
train['meanBudgetByYear'].plot(figsize=(20,10),color="r")
plt.xticks(np.arange(1920,2018,4))
plt.xlabel("Release Year")
plt.ylabel("Budget")
plt.title("Movie Mean Budget by Year")
plt.show()
Count Genres in training dataset No surprises here, moviegoers go to movies to experience drama, data shows at least half of the movies are "Drama" genre. Followed by one third of them in "Comedy", one fourth in "Thriller" and "Action"
def get_dictionary(s):
try:
d = eval(s)
except:
d = {}
return d
train = train
train['genres'] = train['genres'].map(lambda x: sorted([d['name'] for d in get_dictionary(x)])).map(lambda x: ','.join(map(str, x)))
genres = train.genres.str.get_dummies(sep=',')
train = pd.concat([train, genres], axis=1, sort=False)
for col in genres:
print(col, "Genres Movie : ", genres[col].sum())
Count Genres in test dataset A quick check on test dataset to see if there are any oddities on Genres column, matches with overall trends, no issues here.
test = test
test['genres'] = test['genres'].map(lambda x: sorted([d['name'] for d in get_dictionary(x)])).map(lambda x: ','.join(map(str, x)))
genres = test.genres.str.get_dummies(sep=',')
test = pd.concat([test, genres], axis=1, sort=False)
for col in genres:
print(col, "Genres Movie : ", genres[col].sum())
Original Language Counts Most movies, > 80% are English probably due to popularity of English language and popularity of Hollywood movies, would be concerning to generalize the analysis and model we build to worldwide predictions, especially when movies are becoming popular worldwide and other languages are catching up quickly.
plt.figure(figsize=(20,10))
sns.countplot(train['original_language'].sort_values())
plt.title("Original Language Counts")
plt.show()
Status Analysis A quick peak at counts of "status" of movies
train['status'].value_counts()
Do we have revenue for 4 movies? how is it possible for movies not released yet? Something to consider while doing feature engineering on how to handle these exceptions.
train.loc[train['status'] == "Rumored"][['status','revenue']]
How about in test dataset? 4389 movies released, 7 are yet to be released.
test['status'].value_counts()
_Homepage analysis. How many movies have a homepage? Could be used as a binary feature.
train['has_homepage'] = 1
train.loc[pd.isnull(train['homepage']) ,"has_homepage"] = 0
plt.figure(figsize=(20,10))
sns.countplot(train['has_homepage'].sort_values())
plt.title("Has Homepage?")
plt.show()
Correlation between has_homepage and revenue From visualization, we can infer that movies with home page generally have higher revenues, they are more popular.
sns.catplot(x="has_homepage", y="revenue", data=train)
plt.title("Revenue of movies with/wihtout homepage")
Does Movie has tagline? How does it affect revenue? Analysis indicates movies with Tagline are more popular and command more revenues.
train['isTaglineNA'] = 0
train.loc[pd.isnull(train['tagline']) ,"isTaglineNA"] = 1
sns.catplot(x="isTaglineNA", y="revenue", data=train)
plt.title('Revenue of movies with/without tagline');
Is movie title different? Some movies have same titles, analyzing effect of title being same and different with categorical plot and impact on revenue, we can conclude that if title is same movies command more revenue, probably effect of previous title success carrying over to new release with same title.
train['isTitleDifferent'] = 1
train.loc[ train['original_title'] == train['title'] ,"isTitleDifferent"] = 0
sns.catplot(x="isTitleDifferent", y="revenue", data=train)
plt.title('Revenue of movies with single and multiple titles');
What is the effect of orginal language if it English on revenue? Agrees with analysis of Original Language Most of our training dataset is English movies.
train['isOriginalLanguageEng'] = 0
train.loc[train['original_language'] == "en" ,"isOriginalLanguageEng"] = 1
sns.catplot(x="isOriginalLanguageEng", y="revenue", data=train)
plt.title('Revenue of movies when Original Language is English and Not English');
We do have additional dataset with rating and total votes for movies. It would be interesting to augment given dataset with additional features, gives me an opportunity to explore joins, missing records due to joins, impute or decide on what to do with missing values. Keep building skills!
trainAdditionalFeatures = pd.read_csv('data/TrainAdditionalFeatures.csv')
testAdditionalFeatures = pd.read_csv('data/TestAdditionalFeatures.csv')
train = pd.merge(train, trainAdditionalFeatures, how='left', on=['imdb_id'])
test = pd.merge(test, testAdditionalFeatures, how='left', on=['imdb_id'])
trainAdditionalFeatures.head()
Missing value analysis in additional datasets
train['rating'].isna().sum()
train['totalVotes'].isna().sum()
test['rating'].isna().sum()
test['totalVotes'].isna().sum()
Quite a few additional features have missing records. So, let us fill them with some reasonable defaults, 5 votes and a rating of 1.5, cautious defaulting to low values.
train['rating'] = train['rating'].fillna(1.5)
train['totalVotes'] = train['totalVotes'].fillna(5)
test['rating'] = test['rating'].fillna(1.5)
test['totalVotes'] = test['totalVotes'].fillna(5)
Training set Rating Visualization Apart from the defaulting for missing values, data follows roughly normal distribution, great!
plt.figure(figsize=(20,10))
sns.countplot(train['rating'].sort_values())
plt.title("Training dataset Rating Count")
plt.show()
Testing dataset rating visualization Roughly follows training dataset trends, no concerns on using this as feature.
plt.figure(figsize=(20,10))
sns.countplot(test['rating'].sort_values())
plt.title("Test dataset Rating Count")
plt.show()
Let us explore significance of rating on Mean Revenue in Training dataset. A rating of 6 has peak revenue, followed by 7 and 8, this would be good feature to predict on!
train['meanRevenueByRating'] = train.groupby("rating")["revenue"].aggregate('mean')
train['meanRevenueByRating'].plot(figsize=(20,10),color="r")
plt.xlabel("Rating")
plt.ylabel("Revenue")
plt.title("Movie Mean Revenue By Rating")
plt.show()
Mean Revenue vs Total Votes visualization. Movies with votes in the range of 900 to 1600 tend to command good revenues
train['meanRevenueByTotalVotes'] = train.groupby("totalVotes")["revenue"].aggregate('mean')
train['meanRevenueByTotalVotes'].plot(figsize=(20,10),color="r")
plt.xticks(np.arange(0,3000,1000))
plt.xlabel("Total Votes")
plt.ylabel("Revenue")
plt.title("Movie Mean Revenue By Total Votes")
plt.show()
trends of total votes over release year follow overall uptick with movie releases over year, gives confidence we could use this as predictor feature.
train['meantotalVotesByYear'] = train.groupby("release_year")["totalVotes"].aggregate('mean')
train['meantotalVotesByYear'].plot(figsize=(20,10),color="r")
plt.xticks(np.arange(1920,2018,4))
plt.xlabel("Release Year")
plt.ylabel("TotalVotes")
plt.title("Movie Mean Total Votes By Release Year")
plt.show()
Total Votes vs Rating Visualization to analyze how they could potentially be interrelated.
train['meanTotalVotesByRating'] = train.groupby("rating")["totalVotes"].aggregate('mean')
train['meanTotalVotesByRating'].plot(figsize=(20,10),color="r")
plt.xlabel("Rating")
plt.ylabel("Total Votes")
plt.title("Movie Mean Total Votes by Rating")
plt.show()
plt.figure(figsize = (12, 12))
text = ' '.join(train['original_title'].values)
wordcloud = WordCloud(max_font_size=None, background_color='white', width=1200, height=1000).generate(text)
plt.imshow(wordcloud)
plt.title('Top words in titles')
plt.axis("off")
plt.show()
plt.figure(figsize = (12, 12))
text = ' '.join(train['overview'].fillna('').values)
wordcloud = WordCloud(max_font_size=None, background_color='white', width=1200, height=1000).generate(text)
plt.imshow(wordcloud)
plt.title('Top words in overview')
plt.axis("off")
plt.show()
plt.figure(figsize = (12, 12))
text = ' '.join(train['tagline'].fillna('').values)
wordcloud = WordCloud(max_font_size=None, background_color='white', width=1200, height=1000).generate(text)
plt.imshow(wordcloud)
plt.title('Top words in tagline')
plt.axis("off")
plt.show()
Let us do a Heat Map for correlation visualizations
As expected, total votes and revenue show very high correlation at 0.77, as highly voted movies perform well at box office, followed by budget and revenue at 0.75, which makes sense as well as highly popular movies which collect large revenues typically have huge budgets. Popularity takes up next spot at 0.46 correlation to revenue, runtime at 0.22. It is surprising to see rating is only at 0.17. Release year and release day of week and release month show very weak to very weak negative correlation.
train_viz = train[['budget','rating','totalVotes','popularity','runtime','release_year','release_month','release_dayofweek','revenue']]
f,ax = plt.subplots(figsize=(20, 10))
sns.heatmap(train_viz.corr(), annot=True)
plt.show()
With thorough understanding of data used, including additional features, let us tackle feature engineering. We could do this modern ML Engineering way with breaking down into modules, building tests etc., however, I chose to do this within notebook to stay focused on project and finishing it on time, instead of getting side tracked with beautification and production grade hardening of ML Feature engineering pipeline.
Putting together various data exploration activities, transformation snippets, handing JSON, modularizing where possible, wrote a data_pref function.
def get_dictionary(s):
try:
d = eval(s)
except:
d = {}
return d
def get_json_dict(df) :
global json_cols
result = dict()
for e_col in json_cols :
d = dict()
rows = df[e_col].values
for row in rows :
if row is None : continue
for i in row :
if i['name'] not in d :
d[i['name']] = 0
d[i['name']] += 1
result[e_col] = d
return result
def data_prep(df):
# release_date handling, splitting and taking care of double digit years.
df[['release_month','release_day','release_year']]=df['release_date'].str.split('/',expand=True).replace(np.nan, 0).astype(int)
df['release_year'] = df['release_year']
df.loc[ (df['release_year'] <= 19) & (df['release_year'] < 100), "release_year"] += 2000
df.loc[ (df['release_year'] > 19) & (df['release_year'] < 100), "release_year"] += 1900
releaseDate = pd.to_datetime(df['release_date'])
df['release_dayofweek'] = releaseDate.dt.dayofweek
df['release_quarter'] = releaseDate.dt.quarter
rating_na = df.groupby(["release_year","original_language"])['rating'].mean().reset_index()
df[df.rating.isna()]['rating'] = df.merge(rating_na, how = 'left' ,on = ["release_year","original_language"])
vote_count_na = df.groupby(["release_year","original_language"])['totalVotes'].mean().reset_index()
df[df.totalVotes.isna()]['totalVotes'] = df.merge(vote_count_na, how = 'left' ,on = ["release_year","original_language"])
df['rating'] = df['rating'].fillna(1.5)
df.loc[ (df['rating'] == 0 ), "rating"] += 1.5
df['totalVotes'] = df['totalVotes'].fillna(5)
df.loc[ (df['totalVotes'] == 0 ), "totalVotes"] += 1.5
df['weightedRating'] = ( df['rating']*df['totalVotes'] + 6.367 * 1000 ) / ( df['totalVotes'] + 1000 )
# Budget amount... adjust for infation and take log, to adjust for skewness
df['originalBudget'] = df['budget']
df['inflationBudget'] = df['budget'] + df['budget']*2.1/100*(2018-df['release_year']) #Inflation assuming simplistic 2.1% per year
df['budget'] = np.log1p(df['inflationBudget'])
# Gender aggregations
df['genders_0_crew'] = df['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 0]))
df['genders_1_crew'] = df['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 1]))
df['genders_2_crew'] = df['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 2]))
df.loc[ (df['runtime'] == 0 ), "runtime"] += df['runtime'].mean()
df['_collection_name'] = df['belongs_to_collection'].apply(lambda x: x[0]['name'] if x != {} else 0)
le = LabelEncoder()
le.fit(list(df['_collection_name'].fillna('')))
df['_collection_name'] = le.transform(df['_collection_name'].fillna('').astype(str))
df['_num_Keywords'] = df['Keywords'].apply(lambda x: len(x) if x != {} else 0)
df['_num_cast'] = df['cast'].apply(lambda x: len(x) if x != {} else 0)
df['_popularity_mean_year'] = df['popularity'] / df.groupby("release_year")["popularity"].transform('mean')
df['_budget_runtime_ratio'] = df['budget']/df['runtime']
df['_budget_popularity_ratio'] = df['budget']/df['popularity']
df['_budget_year_ratio'] = df['budget']/(df['release_year'])
df['_releaseYear_popularity_ratio'] = df['release_year']/df['popularity']
df['_releaseYear_popularity_ratio2'] = df['popularity']/df['release_year']
df['_popularity_totalVotes_ratio'] = df['totalVotes']/df['popularity']
df['_rating_popularity_ratio'] = df['rating']/df['popularity']
df['_rating_totalVotes_ratio'] = df['totalVotes']/df['rating']
df['_totalVotes_releaseYear_ratio'] = df['totalVotes']/df['release_year']
df['_budget_rating_ratio'] = df['budget']/df['rating']
df['_runtime_rating_ratio'] = df['runtime']/df['rating']
df['_budget_totalVotes_ratio'] = df['budget']/df['totalVotes']
# homepage - check for missing values
df['has_homepage'] = 1
df.loc[pd.isnull(df['homepage']) ,"has_homepage"] = 0
# belongs to collection - check for missingvalues
df['isbelongs_to_collectionNA'] = 0
df.loc[pd.isnull(df['belongs_to_collection']) ,"isbelongs_to_collectionNA"] = 1
# Tagline - check for missing values
df['isTaglineNA'] = 0
df.loc[df['tagline'] == 0 ,"isTaglineNA"] = 1
# Flag for original English language
df['isOriginalLanguageEng'] = 0
df.loc[ df['original_language'] == "en" ,"isOriginalLanguageEng"] = 1
# Flag for title change betwee original and current title
df['isTitleDifferent'] = 1
df.loc[ df['original_title'] == df['title'] ,"isTitleDifferent"] = 0
# Flag for movie release status
df['isMovieReleased'] = 1
df.loc[ df['status'] != "Released" ,"isMovieReleased"] = 0
# extract collection id from belongs_to_collection
df['collection_id'] = df['belongs_to_collection'].apply(lambda x : np.nan if len(x)==0 else x[0]['id'])
# get counts of letters/words and similar conversions
df['original_title_letter_count'] = df['original_title'].str.len()
df['original_title_word_count'] = df['original_title'].str.split().str.len()
df['title_word_count'] = df['title'].str.split().str.len()
df['overview_word_count'] = df['overview'].str.split().str.len()
df['tagline_word_count'] = df['tagline'].str.split().str.len()
df['production_countries_count'] = df['production_countries'].apply(lambda x : len(x))
df['production_companies_count'] = df['production_companies'].apply(lambda x : len(x))
df['cast_count'] = df['cast'].apply(lambda x : len(x))
df['crew_count'] = df['crew'].apply(lambda x : len(x))
# get aggregates by release year, ratings vs votes
df['meanruntimeByYear'] = df.groupby("release_year")["runtime"].aggregate('mean')
df['meanPopularityByYear'] = df.groupby("release_year")["popularity"].aggregate('mean')
df['meanBudgetByYear'] = df.groupby("release_year")["budget"].aggregate('mean')
df['meantotalVotesByYear'] = df.groupby("release_year")["totalVotes"].aggregate('mean')
df['meanTotalVotesByRating'] = df.groupby("rating")["totalVotes"].aggregate('mean')
df['medianBudgetByYear'] = df.groupby("release_year")["budget"].aggregate('median')
for col in ['genres', 'production_countries', 'spoken_languages', 'production_companies']:
df[col] = df[col].map(lambda x: sorted(list(set([n if n in train_dict[col] else col+'_etc' for n in [d['name'] for d in x]])))).map(lambda x: ','.join(map(str, x)))
temp = df[col].str.get_dummies(sep=',')
df = pd.concat([df, temp], axis=1, sort=False)
df.drop(['genres_etc'], axis = 1, inplace = True)
df = df.drop(['id', 'revenue','belongs_to_collection','genres','homepage','imdb_id','overview','runtime'
,'poster_path','production_companies','production_countries','release_date','spoken_languages'
,'status','title','Keywords','cast','crew','original_language','original_title','tagline', 'collection_id'
],axis=1)
df.fillna(value=0.0, inplace = True)
return df
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
test['revenue'] = np.nan
train = pd.merge(train, pd.read_csv('data/TrainAdditionalFeatures.csv'), how='left', on=['imdb_id'])
test = pd.merge(test, pd.read_csv('data/TestAdditionalFeatures.csv'), how='left', on=['imdb_id'])
Quick sanity check on columns read and shape of dataframes
print(train.columns)
print(train.shape)
print(test.columns)
print(test.shape)
Handling JSON expansions, dropping low frequency values to optimize feature matrix.
train['revenue']= np.log1p(train['revenue'])
y = train['revenue'].values
json_cols = ['genres', 'belongs_to_collection', 'production_companies', 'production_countries', 'spoken_languages', 'Keywords', 'cast', 'crew']
for col in tqdm(json_cols) :
train[col] = train[col].apply(lambda x : get_dictionary(x))
test[col] = test[col].apply(lambda x : get_dictionary(x))
train_dict = get_json_dict(train)
test_dict = get_json_dict(test)
# remove categories with bias and low frequency
for col in json_cols :
remove = []
train_id = set(list(train_dict[col].keys()))
test_id = set(list(test_dict[col].keys()))
remove += list(train_id - test_id) + list(test_id - train_id)
for i in train_id.union(test_id) - set(remove) :
if train_dict[col][i] < 10 or i == '' :
remove += [i]
for i in remove :
if i in train_dict[col] :
del train_dict[col][i]
if i in test_dict[col] :
del test_dict[col][i]
Merge both datasets, apply common feature engineering processing.
all_data = data_prep(pd.concat([train, test]).reset_index(drop = True))
train = all_data.loc[:train.shape[0] - 1,:]
test = all_data.loc[train.shape[0]:,:]
Decided to use a simple DecisionTreeRegressor with default arguments as the benchmark model. Other models' performance metrics will be compared to this model's performance for the chosen metric, minimizing RSMLE score for this regression prediction problem.
Let us split data in to train and validation with 80 to 20 ratio.
x_train, x_valid, y_train, y_valid = train_test_split(train,
y,
test_size=0.2,
random_state = 7)
# Show the results of the split
print("Training set has {} samples.".format(x_train.shape[0]))
print("validation set has {} samples.".format(x_valid.shape[0]))
For benchmark model, RMSE is at 3.19, which serves as benchmark for further models to compare model prediction performance
# Fit regression model
regr = DecisionTreeRegressor()
regr.fit(x_train, y_train)
# Predict
y_val_pred = regr.predict(x_valid)
# RMSLE on validation set
print("Mean Squared Log Error for Benchmark Model: {}".format(np.sqrt(mean_squared_log_error( y_valid, y_val_pred ))))
print("Mean Squared Error for Benchmark Model: {}".format(np.sqrt(mean_squared_error( y_valid, y_val_pred ))))
Tweaked params few times to get optimal RMSE, visualized results with eli5 and shap modules.
params = {'num_leaves': 30,
'min_data_in_leaf': 20,
'objective': 'regression',
'max_depth': 5,
'learning_rate': 0.01,
"boosting": "gbdt",
"feature_fraction": 0.9,
"bagging_freq": 1,
"bagging_fraction": 0.9,
"bagging_seed": 11,
"metric": 'rmse',
"lambda_l1": 0.2,
"verbosity": -1}
model1 = lgb.LGBMRegressor(**params, n_estimators = 20000, nthread = 4, n_jobs = -1)
model1.fit(x_train, y_train,
eval_set=[(x_train, y_train), (x_valid, y_valid)], eval_metric='rmse',
verbose=1000, early_stopping_rounds=200)
eli5.show_weights(model1, feature_filter=lambda x: x != '<BIAS>')
explainer = shap.TreeExplainer(model1, x_train)
shap_values = explainer.shap_values(x_train)
shap.summary_plot(shap_values, x_train)
top_columns = x_train.columns[np.argsort(shap_values.std(0))[::-1]][:10]
for col in top_columns:
shap.dependence_plot(col, shap_values, x_train)
For actual training, I intend to use XGBoost, LightGBM, and CatBoost models, with K-fold cross validation with 6 splits (k=6). Will use a blended error score
Setting up 6-fold validation with shuffling
random_seed = 200
n_fold = 6
folds = KFold(n_splits=n_fold, shuffle=True, random_state= random_seed)
def train_model(X, X_test, y, params=None, folds=folds, model_type='lgb', plot_feature_importance=False, model=None):
oof = np.zeros(X.shape[0])
prediction = np.zeros(X_test.shape[0])
scores = []
feature_importance = pd.DataFrame()
for fold_n, (train_index, valid_index) in enumerate(folds.split(X)):
print('Fold', fold_n, 'started at', time.ctime())
X_train, X_valid = X.values[train_index], X.values[valid_index]
y_train, y_valid = y[train_index], y[valid_index]
if model_type == 'lgb':
model = lgb.LGBMRegressor(**params, n_estimators = 20000, nthread = 4, n_jobs = -1)
model.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_valid, y_valid)], eval_metric='rmse',
verbose=1000, early_stopping_rounds=200)
y_pred_valid = model.predict(X_valid)
y_pred = model.predict(X_test, num_iteration=model.best_iteration_)
if model_type == 'xgb':
train_data = xgb.DMatrix(data=X_train, label=y_train)
valid_data = xgb.DMatrix(data=X_valid, label=y_valid)
watchlist = [(train_data, 'train'), (valid_data, 'valid_data')]
model = xgb.train(dtrain=train_data, num_boost_round=20000, evals=watchlist, early_stopping_rounds=200, verbose_eval=500, params=params)
y_pred_valid = model.predict(xgb.DMatrix(X_valid), ntree_limit=model.best_ntree_limit)
y_pred = model.predict(xgb.DMatrix(X_test.values), ntree_limit=model.best_ntree_limit)
if model_type == 'cat':
model = CatBoostRegressor(iterations=20000, eval_metric='RMSE', **params)
model.fit(X_train, y_train, eval_set=(X_valid, y_valid), cat_features=[], use_best_model=True, verbose=False)
y_pred_valid = model.predict(X_valid)
y_pred = model.predict(X_test)
oof[valid_index] = y_pred_valid.reshape(-1,)
scores.append(mean_squared_error(y_valid, y_pred_valid) ** 0.5)
prediction += y_pred
if model_type == 'lgb':
# feature importance
fold_importance = pd.DataFrame()
fold_importance["feature"] = X.columns
fold_importance["importance"] = model.feature_importances_
fold_importance["fold"] = fold_n + 1
feature_importance = pd.concat([feature_importance, fold_importance], axis=0)
prediction /= n_fold
print('CV mean score: {0:.4f}, std: {1:.4f}.'.format(np.mean(scores), np.std(scores)))
if model_type == 'lgb':
feature_importance["importance"] /= n_fold
if plot_feature_importance:
cols = feature_importance[["feature", "importance"]].groupby("feature").mean().sort_values(
by="importance", ascending=False)[:50].index
best_features = feature_importance.loc[feature_importance.feature.isin(cols)]
plt.figure(figsize=(16, 12));
sns.barplot(x="importance", y="feature", data=best_features.sort_values(by="importance", ascending=False));
plt.title('LGB Features (avg over folds)');
return oof, prediction, feature_importance
return oof, prediction
else:
return oof, prediction
params = {'objective':'regression',
'num_leaves' : 30,
'min_data_in_leaf' : 20,
'max_depth' : 9,
'learning_rate': 0.004,
'min_child_samples':100,
'feature_fraction':0.9,
"bagging_freq": 1,
"bagging_fraction": 0.9,
'lambda_l1': 0.2,
"bagging_seed": random_seed,
"metric": 'rmse',
'subsample':.8,
'colsample_bytree':.9,
"random_state" : random_seed,
"verbosity": -1}
oof_lgb, prediction_lgb, _ = train_model(train, test, y, params=params, model_type='lgb',
plot_feature_importance=True)
xgb_params = {'eta': 0.01,
'objective': 'reg:linear',
'max_depth': 6,
'subsample': 0.6,
'colsample_bytree': 0.7,
'eval_metric': 'rmse',
'seed': 25,
'silent': True}
oof_xgb, prediction_xgb = train_model(train, test, y, params=xgb_params, model_type='xgb')
cat_params = {'learning_rate': 0.004,
'depth': 5,
'l2_leaf_reg': 10,
'colsample_bylevel': 0.8,
'bagging_temperature': 0.2,
'od_type': 'Iter',
'od_wait': 100,
'random_seed': random_seed,
'allow_writing_files': False}
oof_cat, prediction_cat = train_model(train, test, y, params=cat_params, model_type='cat')
train_stack = np.vstack([oof_lgb, oof_xgb, oof_cat]).transpose()
train_stack = pd.DataFrame(train_stack, columns=['lgb', 'xgb', 'cat'])
test_stack = np.vstack([prediction_lgb, prediction_xgb, prediction_cat]).transpose()
test_stack = pd.DataFrame(test_stack, columns=['lgb', 'xgb', 'cat'])
params = {'num_leaves': 8,
'min_data_in_leaf': 20,
'objective': 'regression',
'max_depth': 3,
'learning_rate': 0.01,
"boosting": "gbdt",
"bagging_seed": 11,
"metric": 'rmse',
"lambda_l1": 0.2,
"verbosity": -1}
oof_lgb_stack, prediction_lgb_stack, _ = train_model(train_stack, test_stack, y,
params=params, model_type='lgb', plot_feature_importance=True)
Now with training of the 3 chosen modeling techniques and stacked model with all 3 in-place, let us tackle predictions on test dataset, and see how closely they are matching with one another, and pick final model
sub = pd.read_csv('data/sample_submission.csv')
all_pred = pd.read_csv('data/sample_submission.csv')
sub['revenue'] = np.expm1(prediction_lgb)
sub.to_csv("output/lgb.csv", index=False)
all_pred['pred_lgb'] = np.expm1(prediction_lgb)
sub['revenue'] = np.expm1(prediction_xgb)
sub.to_csv("output/xgb.csv", index=False)
all_pred['pred_xgb'] = np.expm1(prediction_xgb)
sub['revenue'] = np.expm1(prediction_cat)
sub.to_csv("output/cat.csv", index=False)
all_pred['pred_cat'] = np.expm1(prediction_cat)
sub['revenue'] = np.expm1((prediction_lgb + prediction_xgb) / 2)
sub.to_csv("output/blend_lgb_xgb.csv", index=False)
all_pred['pred_blend_lgb_xgb'] = np.expm1((prediction_lgb + prediction_xgb) / 2)
sub['revenue'] = np.expm1((prediction_lgb + prediction_xgb + prediction_cat) / 3)
sub.to_csv("output/blend_all3.csv", index=False)
all_pred['pred_blend_all3'] = np.expm1((prediction_lgb + prediction_xgb + prediction_cat) / 3)
sub['revenue'] = prediction_lgb_stack
sub.to_csv("output/stack_lgb.csv", index=False)
all_pred['pred_stack_lgb'] = np.expm1(prediction_lgb_stack)
all_pred.info()
Visualizing the HeatMap, blend of all 3 models pred_blend_all3 is shows very high correlation, perfect correlation with 2 other models and 0.99 correlation with 2 other. I am going to pick that as my final model.
pred_viz = all_pred[['pred_lgb','pred_xgb','pred_cat','pred_blend_lgb_xgb','pred_blend_all3','pred_stack_lgb']]
f,ax = plt.subplots(figsize=(20, 10))
sns.heatmap(pred_viz.corr(), annot=True)
plt.show()
pd.plotting.scatter_matrix(pred_viz, alpha = 0.5, figsize = (14,8), diagonal = 'kde');